This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.
HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).
The cleaning steps demonstrated include:
case_when())replace missing with dealing with cases (all lower, etc) case_when() factors
Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)
You can view the original raw dataset below:
Text here about a cleaning pipeline of dplyr verbs… order is important
Variable names are used so often, it is best that they have “clean” syntax. We suggest the following:
The names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).
names(linelist_raw)
## [1] "row_num" "case_id" "generation" "infection date"
## [5] "date onset" "hosp date" "date_of_outcome" "outcome"
## [9] "gender" "hospital" "lon" "lat"
## [13] "infector" "source" "age" "age_unit"
## [17] "fever" "chills" "cough" "aches"
## [21] "vomit"Note: To use a variable names that include spaces, surround the name with back-ticks, for example: linelist$`infection date`
On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).
The function clean_names() from the package janitor standardizes variable names by transliterating to unique ASCII names by doing the following:
case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)replace = argument (e.g. replace = c(onset = “date_of_onset”))# send the dataset through the function clean_names()
linelist <- linelist_raw %>%
janitor::clean_names()
# see the new names
names(linelist)
## [1] "row_num" "case_id" "generation" "infection_date"
## [5] "date_onset" "hosp_date" "date_of_outcome" "outcome"
## [9] "gender" "hospital" "lon" "lat"
## [13] "infector" "source" "age" "age_unit"
## [17] "fever" "chills" "cough" "aches"
## [21] "vomit"Re-naming variables manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new variable name is given before the old variable name.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize variable name syntax
janitor::clean_names() %>%
# manually re-name variables
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome)Now you can see that the variables names have been changed:
CAUTION: This tab may follow from previous tabs.
Often the first step of cleaning data is selecting the variables you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.
Here are all the variable names in the linelist:
names(linelist)
## [1] "row_num" "case_id" "generation"
## [4] "date_infection" "date_onset" "date_hospitalisation"
## [7] "date_outcome" "outcome" "gender"
## [10] "hospital" "lon" "lat"
## [13] "infector" "source" "age"
## [16] "age_unit" "fever" "chills"
## [19] "cough" "aches" "vomit"select() you can do the following:Select only the variables you want to remain, and their order of appearance
# linelist dataset is piped through select() command, and names() prints just the variable names
linelist %>%
select(case_id, date_onset, date_hospitalisation, fever) %>%
names() # display the variable names
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "fever"Indicate which variables to remove by placing a minus symbol “-” in front of the variable name (e.g. select(-outcome)), or a vector of variable names (as below). All other variables will be retained.
Inside select() you can use normal operators such as c() to list several variables, : for consecutive variables, ! for opposite, & for AND, and | for OR.
linelist %>%
select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom variables
names()
## [1] "row_num" "case_id" "generation"
## [4] "date_infection" "date_hospitalisation" "date_outcome"
## [7] "outcome" "gender" "hospital"
## [10] "lon" "lat" "infector"
## [13] "source" "age" "age_unit"Re-order the variables - use everything() to signify all other variables not specified in the select() command:
# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>%
select(case_id, date_onset, date_hospitalisation, gender, everything()) %>%
names()
## [1] "case_id" "date_onset" "date_hospitalisation"
## [4] "gender" "row_num" "generation"
## [7] "date_infection" "date_outcome" "outcome"
## [10] "hospital" "lon" "lat"
## [13] "infector" "source" "age"
## [16] "age_unit" "fever" "chills"
## [19] "cough" "aches" "vomit"As well as everything() there are several special functions that work within select(), namely:
everything() - all other variables not mentionedlast_col() - the last columnstarts_with() - matches to a specified prefix. Example: select(starts_with("date"))ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))contains() - variables containing a character string. Example: select(contains("time"))matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))num_range() -any_of() - matches if variable is named. Useful if name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))where() - applies a function to all variables and selects those which are TRUEselect()to the cleaning pipe chain:In the linelist, there is one variable we do not need: row_num. Remove it by adding a select() command to the cleaning pipe chain:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize variable name syntax
janitor::clean_names() %>%
# manually re-name variables
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-row_num)CAUTION: This tab may follow from previous tabs.
See section on object classes
Here we want to ensure that the class of each variable is appropriate. First we run some checks on the classes of important variables.
The class of the “age” variable is character. To perform analysis, we need those numbers to be recognized as numeric!
The class of the “date_onset” variable is also character! To perform analysis, these dates must be recognized as dates!
Use table() or another method to see all the values, can see that we see that one date was entered in a different format (15 April 2014) than all the others!
##
## 15 April 2014 2010-05-07 2010-05-08 2010-05-27 2010-06-15
## 1 1 1 1 1
## 2010-06-18
## 1
This means before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data. Or, we can do this using mutate() and recode() in our cleaning pipe chain, before the commands to convert to class Date. LINK TO CLASSIFYING VARIABLE AS DATE.
The new mutate line can be read as: mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE. Note that this pattern (OLD = NEW) is the opposite of most R patterns. The R development community is working on revising this.
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize variable name syntax
janitor::clean_names() %>%
# manually re-name variables
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-row_num) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
# fix incorrect values # old value # new value
mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>%
# correct the class of the variables
mutate(age = as.numeric(age),
date_onset = as.Date(date_onset, format = "%Y-%m-%d"))Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.
class(linelist$date_infection)
## [1] "POSIXct" "POSIXt"
head(linelist$date_infection)
## [1] "2014-04-09 UTC" NA NA "2014-05-07 UTC"
## [5] NA "2014-05-06 UTC"You can use The dplyr function across() with mutate() to convert several variables at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the variables where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.
across() we also use the function where().is.character(), is.numeric(), and is.logical()) are from base Racross() are written without the empty parentheses ()CAUTION: This tab may follow from previous tabs.
After selecting columns, a typical cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept. A simple example below filters to keep only the rows where case_id is not missing (three rows are removed):
A more complex example:
Run a simple histogram of onset dates to see that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.
If we simply filter linelist by date of onset (after June 2013) we may make a mistake! Applying filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any observations in the later epidemic with a missing date of onset!
DANGER: Filtering to greater than (>) or less than (<) a date can remove any observations with missing date values (NA)! This is because NA is treated as infinitely large and small.
We also know that this first epidemic occurred at Hospital A, Hospital B, and there were 10 cases at Connaught Hospital. Hospitals A & B did not have cases in the second epidemic, but Connaught Hospital had many. This is a complex filter to apply - it is wise to tabulate these variables to know exactly how many rows we expect should be removed.
Let’s examine a cross-tabulation to make sure we exclude only the correct rows:
table(Hospital = linelist$hospital, # hospital name
YearOnset = lubridate::year(linelist$date_onset), # year of the date_onset
useNA = "always") # show missing values
## YearOnset
## Hospital 2010 2011 2014 2015 <NA>
## Connaught Hopital 0 0 38 6 3
## Connaught Hospital 8 2 1322 327 66
## Hospital A 40 13 0 0 1
## Hospital B 41 12 0 0 1
## Military Hopital 0 0 20 8 2
## Military Hospital 0 0 580 168 38
## Mitylira Hopital 0 0 1 0 0
## Mitylira Hospital 0 0 59 17 3
## other 0 0 679 168 38
## Princess Christian Maternity Hopital (PCMH) 0 0 10 1 0
## Princess Christian Maternity Hospital (PCMH) 0 0 306 90 15
## Rokupa Hopital 0 0 10 1 0
## Rokupa Hospital 0 0 332 94 17
## <NA> 0 0 1106 296 67We want to exclude only the nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01"))) observations from 2012 and 2013 at those three hospitals (A, B, and Connaught), including the 2 from Hospitals A & B with missing onset dates, but not any others with missing onset dates. We start with a linelist of nrow(linelist). Here is our statement:
linelist <- linelist %>%
filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))
nrow(linelist)
## [1] 5888When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, the 10 Connaught Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.
table(Hospital = linelist$hospital, # hospital name
YearOnset = lubridate::year(linelist$date_onset), # year of the date_onset
useNA = "always") # show missing values
## YearOnset
## Hospital 2014 2015 <NA>
## Connaught Hopital 38 6 3
## Connaught Hospital 1322 327 66
## Military Hopital 20 8 2
## Military Hospital 580 168 38
## Mitylira Hopital 1 0 0
## Mitylira Hospital 59 17 3
## other 679 168 38
## Princess Christian Maternity Hopital (PCMH) 10 1 0
## Princess Christian Maternity Hospital (PCMH) 306 90 15
## Rokupa Hopital 10 1 0
## Rokupa Hospital 332 94 17
## <NA> 1106 296 67Multiple filter statements can be separated by commas, or you can always pipe to a separate filter() statement for clarity. Adding these filter statements to the cleaning pipe chain now looks like this:
# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
# standardize variable name syntax
janitor::clean_names() %>%
# manually re-name variables
# NEW name # OLD name
rename(date_infection = infection_date,
date_hospitalisation = hosp_date,
date_outcome = date_of_outcome) %>%
# remove column
select(-row_num) %>%
# fix incorrect values # old value # new value
mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>%
# correct the class of the variables
mutate(age = as.numeric(age),
date_onset = as.Date(date_onset, format = "%Y-%m-%d")) %>%
# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
###################################################
filter(!is.na(case_id), # keep only rows where case_id is not missing
date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))
) # close filterfilter(dataset, criteria) OR subset like: dataset_new <- dataset[criteria,criteria]
nrow(linelist %>% filter())
CAUTION: This tab may follow from previous tabs.
We advise creating new variables with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use the base R style to create a new variable.
As explained in the section on dplyr and tidyverse coding style (LINK HERE), a chain of ‘verb’ functions operate on a dataset through ‘pipes’ (%>%), passing the output from one verb to the next. The verb mutate() used to add a new variable or modify an existing one.
Below are some example of creating new variables with mutate(). The syntax is: new_variable_name = value or function. It is best practice to separate each new variable with a comma and new line.
linelist <- linelist %>% # creating new, or modifying old dataset
mutate(new_var_dup = case_id, # new variable = duplicate/copy another variable
new_var_static = 7, # new variable = all values the same
new_var_static = new_var_static + 5, # you can overwrite a variable, and can modify a variable multiple times
new_var_calc = (age / 12), # new variable = a calculation
new_var_paste = paste0(hospital, " (", date_hospitalisation, ")") # new variable = pasting together values from other variables
) Scroll to the right to see the new variables:
# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )TIP: The verb transmute() adds new variables just like mutate() but also drops/removes all other variables that you do not mention.
To recode the values in a variable, mutate() is also used.
In this case, we notice that we need to clean the variable “hospital”. There are several incorrect spelling, and many missing values.
table(linelist$hospital, useNA = "always")
##
## Connaught Hopital
## 47
## Connaught Hospital
## 1715
## Military Hopital
## 30
## Military Hospital
## 786
## Mitylira Hopital
## 1
## Mitylira Hospital
## 79
## other
## 885
## Princess Christian Maternity Hopital (PCMH)
## 11
## Princess Christian Maternity Hospital (PCMH)
## 411
## Rokupa Hopital
## 11
## Rokupa Hospital
## 443
## <NA>
## 1469For to change spellings one-by-one, you can use the recode() function within the mutate function. The code is saying that the variable “hospital” should be defined as the current variable “hospital” but with certain changes (syntax is OLD = NEW). Don’t forget commas!
linelist <- linelist %>%
mutate(hospital = recode(hospital,
# OLD = NEW
"Mitylira Hopital" = "Military Hospital",
"Mitylira Hospital" = "Military Hospital",
"Military Hopital" = "Military Hospital",
"Connaught Hopital" = "Connaught Hospital",
"Rokupa Hopital" = "Rokupa Hospital",
"other" = "Other",
"Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
))
table(linelist$hospital, useNA = "always")
##
## Connaught Hospital
## 1762
## Military Hospital
## 896
## Other
## 885
## Princess Christian Maternity Hospital (PCMH)
## 422
## Rokupa Hospital
## 454
## <NA>
## 1469TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.
TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").
If you need to use logic statements with multiple variables, or want to use operators like %in%, use case_when() instead, as below. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic!
linelist <- linelist %>%
mutate(hospital = case_when(hospital == "Connaught Hopital" ~ "Connaught Hospital",
hospital == "Rokupa Hopital" ~ "Rokupa Hospital",
hospital %in% c("Mitylira Hopital",
"Mitylira Hospital",
"Mitylira Hospital",
"Military Hopital") ~ "Military Hospital",
is.na(hospital) ~ "Missing",
hospital == "Princess Christian Maternity Hopital (PCMH)" ~ "Princess Christian Maternity Hospital (PCMH)",
TRUE ~ hospital)
)
table(linelist$hospital, useNA = "always")
##
## Connaught Hospital
## 1762
## Military Hospital
## 896
## Missing
## 1469
## Other
## 885
## Princess Christian Maternity Hospital (PCMH)
## 422
## Rokupa Hospital
## 454
## <NA>
## 0To change missing values to a character value, such as “Missing”, use the function replace_na() in the same manner as recode above:
Likewise you can quickly convert character values to NA using na_if(), as below:
Intro For simple cases you can use ifelse() or if_else(). In most cases it is better to use case_when().
ifelse() and if_else():
These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if TRUE, value if FALSE). if_else() is a special version from dplyr that handles dates.
Stringing together ifelse statements - NOT ADVISED!! Difficult to read and keep track of.
IMAGE of ifelse string with X across is.
Use case-when() instead.
You can reference other variables with the ifelse() function within mutate():
Missing if… na_if() lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),
coalesce()
if_else(), ifelse()
recode CLEANING MISSPELLINGS HOSPITAL NAME
Replace
CAUTION: This tab may follow from previous tabs.
## load cleaning rules and only keep variables in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
filter(variable %in% c(names(mll_raw), ".global"))
## define variables that are not cleand
unchanged <- c(
"epilink_relationship",
"narratives",
"epilink_relationship_detail"
)
mll_clean <- mll_raw %>%
## convert to tibble
as_tibble() %>%
## clean variables using cleaning rules
clean_data(
wordlists = mll_cleaning_rules,
protect = names(.) %in% unchanged
)CAUTION: This tab may follow from previous tabs.
Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html
Taken from website above:
Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:
starwars %>%
select(name, mass, species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:
starwars %>%
select(name, mass, species) %>%
group_by(species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.
If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new variable by assigning it a value. In the command below, the variable new_var does not exist until after the command is executed. In this simple example the variable is assigned the static value “new value”, so for all rows the value will be “new value”.
You can also give the new variable a dyanmic value as shown below, or using the case_when() command explained in the next tab.
case_when())CAUTION: This tab may follow from previous tabs.
TODO tutorial on using case_when()
For example, creating age groups cut()
case_when()
age_categories() (R4Epis package)
by percentile
WHAT TO DO IF AGE IS SPREAD ACROSS TWO VARAIBLES (e.g. numeric age + unit)
CAUTION: This tab may follow from previous tabs.
Within a group, indicate/convert to the highest value in the group
Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)